Biostatistics For Dummies, 2nd Edition (Monika Wahi, John Pezzullo)

230 PART 5 Looking for Relationships with Correlation and Regression

Calculating the Sample Size You Need

To estimate how many data points you need for a regression analysis, you need to

first ask yourself why you’re doing the regression in the first place.»

» ^{Do you want to show that the two variables are statistically significantly}

associated? If so, you want to calculate the sample size required to achieve a

certain statistical power for the significance test (see Chapter 3 for an introduc-

tion to statistical power).»

» ^{Do you want to estimate the value of the slope (or intercept) to within a}

certain margin of error? If so, you want to calculate the sample size required

to achieve a certain precision in your estimate.

Testing the statistical significance of a slope is exactly equivalent to testing the

statistical significance of a correlation coefficient, so the sample-size calculations

are also the same for the two types of tests. If you haven’t already, check out

Chapter 15, which contains guidance and formulas to estimate how many partici-

pants you need to test for any specified degree of correlation.

If you’re using regression to estimate the value of a regression coefficient — for

example, the slope of the straight line — then the sample-size calculations

become more complicated. The precision of the slope depends on several factors:»

» ^{The number of data points:}^{More data points give you greater precision. SEs}

vary inversely with the square root of the sample size. Alternatively, the

required sample size varies inversely with the square of the desired SE. So, if

you quadruple the sample size, you cut the SE in half. This is a very important

and generally applicable principle.»

» ^{Tightness of the fit of the observed points to the line:}^{The closer the data}

points hug the line, the more precisely you can estimate the regression

coefficients. The effect is directly proportional, in that twice as much Y-scatter

of the points produces twice as large a SE in the coefficients.»

» ^{How the data points are distributed across the range of the}^X^variable:

This effect is hard to quantify, but in general, having the data points spread

out evenly over the entire range of X produces more precision than having

most of them clustered near the middle of the range.

Given these factors, how do you strategically design a study and gather data for a

linear regression where you’re mainly interested in estimating a regression coef-

ficient to within a certain precision? One practical approach is to first conduct a

study that is small and underpowered, called a pilot study, to estimate the SE of the